Part I - Exploratory data Analysis on the Ford gobike 2019feb tripdata

by Moses Ojonuba

Introduction

Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area, California.

Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.

This data set includes information about individual rides made in Ford GoBike bike-sharing system covering the greater San Francisco Bay area

Dataset Dictionary:

  • duration_sec: Trip Duration (seconds)
  • start_time>: Start Time and Date
  • end_time: End Time and Date
  • start_station_id: Start Station ID
  • start_station_name: Start Station Name
  • start_station_latitude: Start Station Latitude
  • start_station_longitude: Start Station Longitude
  • end_station_id: End Station ID
  • end_station_name: End Station Name
  • end_station_latitude: End Station Latitude
  • end_station_longitude: End Station Longitude
  • bike_id: Bike ID
  • user_type: User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)
  • member_birth_year: Member Year of Birth
  • member_gender: Member Gender
  • bike_share_for_all_trip: Boolean to track members who are enrolled in the "Bike Share for All" program for low-income residents

Preliminary Wrangling

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
!pip install plotly==5.9.0 --quiet
import plotly.express as px
%matplotlib inline
In [2]:
sns.set_style('darkgrid')
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.facecolor'] = '#00000000'
In [3]:
#loading data
rides_df = pd.read_csv('201902-fordgobike-tripdata.csv')

rides_df.head()
Out[3]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes
In [4]:
rides_df.shape
Out[4]:
(183412, 16)
In [5]:
rides_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
duration_sec               183412 non-null int64
start_time                 183412 non-null object
end_time                   183412 non-null object
start_station_id           183215 non-null float64
start_station_name         183215 non-null object
start_station_latitude     183412 non-null float64
start_station_longitude    183412 non-null float64
end_station_id             183215 non-null float64
end_station_name           183215 non-null object
end_station_latitude       183412 non-null float64
end_station_longitude      183412 non-null float64
bike_id                    183412 non-null int64
user_type                  183412 non-null object
member_birth_year          175147 non-null float64
member_gender              175147 non-null object
bike_share_for_all_trip    183412 non-null object
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB
In [6]:
rides_df.isna().sum().sort_values()
Out[6]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_latitude        0
start_station_longitude       0
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
bike_share_for_all_trip       0
start_station_id            197
start_station_name          197
end_station_id              197
end_station_name            197
member_birth_year          8265
member_gender              8265
dtype: int64
In [7]:
rides_df.duplicated().sum()
Out[7]:
0
In [8]:
rides_df.user_type.unique()
Out[8]:
array(['Customer', 'Subscriber'], dtype=object)
In [9]:
rides_df.member_gender.unique()
Out[9]:
array(['Male', nan, 'Other', 'Female'], dtype=object)
In [10]:
rides_df.bike_share_for_all_trip.unique()
Out[10]:
array(['No', 'Yes'], dtype=object)

from Assesment change datatype drop missing values

In [11]:
rides_df.dropna(inplace=True)
In [12]:
#convert to string
rides_df[['start_station_id', 'end_station_id', 'bike_id']] = rides_df[['start_station_id', 'end_station_id', 'bike_id']].astype(str)


rides_df['member_birth_year'] =rides_df['member_birth_year'].astype(int)

rides_df[['start_time', 'end_time']] = rides_df[['start_time', 'end_time']].apply(pd.to_datetime)
In [13]:
rides_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
duration_sec               174952 non-null int64
start_time                 174952 non-null datetime64[ns]
end_time                   174952 non-null datetime64[ns]
start_station_id           174952 non-null object
start_station_name         174952 non-null object
start_station_latitude     174952 non-null float64
start_station_longitude    174952 non-null float64
end_station_id             174952 non-null object
end_station_name           174952 non-null object
end_station_latitude       174952 non-null float64
end_station_longitude      174952 non-null float64
bike_id                    174952 non-null object
user_type                  174952 non-null object
member_birth_year          174952 non-null int64
member_gender              174952 non-null object
bike_share_for_all_trip    174952 non-null object
dtypes: datetime64[ns](2), float64(4), int64(2), object(8)
memory usage: 22.7+ MB

What is the structure of your dataset?

The Dataset has 183412 rows, and 16 columns

What is/are the main feature(s) of interest in your dataset?

Average Age of Riders, Average Duration of Trips, Gender distribution of Riders, Ditribution of User type.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

birth year, duration, gender and user type variables

Univariate Exploration

Q. What are the distribution of age?
In [14]:
#create an age column
rides_df['age'] = 2019 - rides_df['member_birth_year']
In [15]:
rides_df.head(3)
Out[15]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip age
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984 Male No 35
2 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972 Male No 47
3 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989 Other No 30
In [16]:
rides_df['age'].describe()
Out[16]:
count    174952.000000
mean         34.196865
std          10.118731
min          18.000000
25%          27.000000
50%          32.000000
75%          39.000000
max         141.000000
Name: age, dtype: float64

our age columns seems to have quite a number of outliers seeing that 75% of the individuals are below 39 years old. the maximum age there is 141 which is likely an error. let us visualize it for a clearer picture

In [17]:
plt.boxplot(rides_df['age'], vert=False)
plt.xlabel('Age')
plt.title('Distribution of Age');
Observation

As we can see from the boxplot above, we have quite a number of outliers. we have alot of people in the senior's category of age. this is certainly expected because seniors are encouraged to ride bikes as a form of excercise therefore we would consider these outliers as legitimate data point. however we doubt the possibility of someone riding a bike at age 141 or even at age 119. we have no such record of someone been alive at age 141 as at 2019 and even they existed, it would certainly be risky to allow them to ride a bike.

to deal with this, we will set our age limit for this analysis to be 100.

In [18]:
high = rides_df["age"] < 101

rides = rides_df[high]

rides.age.describe()
Out[18]:
count    174880.000000
mean         34.162043
std           9.974001
min          18.000000
25%          27.000000
50%          32.000000
75%          39.000000
max          99.000000
Name: age, dtype: float64
In [19]:
rides['age'].hist(bins= 20);
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution');

From the histogram distribution plot above, we can see that most of the riders are within the age of 25 and 35.

Q What is the distribution of trip Duration?
In [20]:
rides['duration_sec'].describe()
Out[20]:
count    174880.000000
mean        704.022358
std        1642.514884
min          61.000000
25%         323.000000
50%         510.000000
75%         789.000000
max       84548.000000
Name: duration_sec, dtype: float64
In [21]:
rides['duration_sec'].hist(bins=500)
plt.xlabel('Duration[Sec]')
plt.title('Distribution of Trip Duration');
Observation

the trip Duration histogram is highly skewed due to the long duration of some trips. as a result of this, we will be using the median in answering other questions related to duration. we are using the median because it is not affected by outliers unlike mean.

Q What is the distribution of gender?
In [22]:
fig= px.pie(rides, names='member_gender', width=600, height=300, title='Distribution of Gender')

fig.show()
Observation

A large percentage of the riders are male(74%). thrice as much as the female(23%).

According to the investigation carried out by Elizabeth Plank at the bike paths of New York City, Turns out way more men ride bikes than women: “In the U.S., 1 woman for every 3 men gets around on a bicycle”.

According to Plank “In London, 77% of bike trips are taken by men and only 5% of women identify as frequent cyclists.”

https://slate.com/human-interest/2014/09/gender-gap-alert-men-ride-bikes-way-more-than-women-do.html

Q What is the distribution of User type?
In [23]:
#define a function to plot categorical feature
def plot_cat(var, l=8,b=5):
    plt.figure(figsize = (l, b))
    sns.countplot(rides[var], order = rides_df[var].value_counts().index)
In [24]:
#call function to plot countplot
plot_cat('user_type')
Observation

over 90% of riders in our dataset are subscriber. that means they pay subscription fee which could be monthly or year. only a small percentage of the riders are customers. that means they pay at the station or Kiosk per every trip.

Q. What percentage of riders are part of the Bike share for all Trip Program?
In [25]:
plot_cat('bike_share_for_all_trip')
Observation

A large percentage of riders are not part of the program. we don't have detailed information to know the reason for this

Bivariate Exploration

Q. What the average duration of trip for the categories of gender?
In [26]:
rides.groupby('member_gender')['duration_sec'].median()
Out[26]:
member_gender
Female    567.0
Male      493.0
Other     555.5
Name: duration_sec, dtype: float64
In [27]:
rides.groupby('member_gender')['duration_sec'].median().sort_values(ascending=False).plot(kind='bar')
plt.xlabel('Gender')
plt.ylabel('Duration[sec]')
plt.title('Average Duration of Trips for Gender');
Observation

Female go on longer trips (567 seconds or aproximately 10mins). though the difference is much from the trip duration for males.

Q. What the average duration of trip for the categories of user type?
In [28]:
rides.groupby('user_type')['duration_sec'].median()
Out[28]:
user_type
Customer      780
Subscriber    490
Name: duration_sec, dtype: int64
In [29]:
rides.groupby('user_type')['duration_sec'].median().plot(kind='bar')
plt.xlabel('User Type')
plt.ylabel('Duration[Sec]')
plt.title('Average Duration of Trips for User Type');
Observation

Customers go on a longer trip (780 sec or 13mins) than Subscribers(490 secs or 9mins).

Q. Which week of the month did people go on longer rides?
In [30]:
rides_df['ride_start_week'] = rides_df['start_time'].dt.week
rides_df.groupby('ride_start_week')['duration_sec'].median()
Out[30]:
ride_start_week
5    494
6    498
7    512
8    532
9    498
Name: duration_sec, dtype: int64
In [31]:
rides_df.groupby('ride_start_week')['duration_sec'].median().sort_values().plot(kind='barh')
plt.xlabel('Duration[Sec]')
plt.ylabel('Week of the Month')
plt.title('Duration of Trips per Week of the Month');
Observation

riders went on longer trip in the fourth week(8). the average trip duration for the fourth week was 532 sec or aproximately 8mins. though no much difference from other weeks

Q. Which day of the week did riders go on longer trip?
In [32]:
# Add a column for the weekday of the start of the ride
rides_df['ride_start_weekday'] = rides_df['start_time'].dt.day_name()

# Print the median trip time per weekday
print(rides_df.groupby('ride_start_weekday')['duration_sec'].median())
ride_start_weekday
Friday       511.0
Monday       503.0
Saturday     539.0
Sunday       534.0
Thursday     512.0
Tuesday      502.5
Wednesday    503.0
Name: duration_sec, dtype: float64
In [33]:
rides_df.groupby('ride_start_weekday')['duration_sec'].median().sort_values(ascending=False).plot(kind='bar')
plt.xlabel('Week Day')
plt.ylabel('Duration[Sec]')
plt.title('Average Duration of Trips on Weekdays');
Observation

Riders went on longer trips on Weekends (Saturdays and Sundays)

Q. Is There a relationship between Age and Duration?
In [34]:
fig = px.scatter(rides_df, x='age', y='duration_sec', title='Duration vs Age')
fig.show()
Observation

There is no linear relationship between age and duration of a trip. However most people who took longer trips were between the age of 25 and 45

Q. What is the distribution of the locations accross sans Franscisco?
In [35]:
fig = px.scatter_mapbox(
    rides_df,  # Our DataFrame
    lat='start_station_latitude',
    lon='start_station_longitude',
    center={"lat": 37.773972, "lon": -122.431297},  # Map will be centered on San Francisco
    width=600,  # Width of map
    height=600,  # Height of map
    hover_data=['start_station_name'],  # Display Station name when hovering mouse over station
    title = 'Dsitribution of Stations'
)

fig.update_layout(mapbox_style="open-street-map")

fig.show()
Observation

there are three clusters of the various bike stations

Multivariate Exploration

some riders take a bike from a station and returns it back to the same station. let us call such trips as round_trips

In [36]:
# Create round trips
trips = (rides['start_station_name'] == rides['end_station_name'])
In [37]:
round_trips = rides[trips]
In [38]:
len(round_trips)
Out[38]:
3458

3458 of the trips were round_trips. the bikes were picked and dropped and the same station

Q. What is the Age Distribution for the categories of Gender who took a round trip?
In [39]:
#Plot an Histogram
fig = px.histogram(round_trips, 
                   x='age', 
                   color='member_gender',
                   marginal='box', 
                   nbins=50, 
                   title='Distribution of Age for Gender')
fig.update_layout(bargap=0.1)
fig.show()
Observation

The average age of males who took round trips is 36, while that of female is 29

Q. What is the Age Distribution for the categories of Users who took a round trip?
In [40]:
fig = px.histogram(round_trips, 
                   x='age', 
                   color='user_type',
                   marginal='box', 
                   nbins=47, 
                   title='Distribution of Age of User Type')
fig.update_layout(bargap=0.1)
fig.show()
Observation

The average age of customers who took round trips is 30, while that of Subscriber is 31

Q. What is Distribution of Trip Duration for Gender
In [41]:
fig = px.histogram(round_trips, 
                   x='duration_sec', 
                   color='member_gender',
                   marginal='box', 
                   nbins=75, 
                   title='Distribution of Trip Duration for Gender')
fig.update_layout(bargap=0.1)
fig.show()
Observation

The average duration of round trip for female (1114 secs or 18mins) is greater than that of Male (910 secs or 15mins). females took longer time than males

In [42]:
#plot Histogram
fig = px.histogram(round_trips, 
                   x='duration_sec', 
                   color='user_type',
                   marginal='box', 
                   nbins=75, 
                   title='Distribution of Trip Duration for User Type')
fig.update_layout(bargap=0.1)
fig.show()
Observation

customers spends an average time of 1680.5 secs or 28mins on round trips while subscribers spends an average time of 757secs or 12mins

Conclusions

  • A large percentage of the riders are male(74%). thrice as much as the female(23%).

  • over 90% of riders in our dataset are subscriber. that means they pay subscription fee which could be monthly or year. only a small percentage of the riders are customers. that means they pay at the station or Kiosk per every trip.

  • Female go on longer trips (567 seconds or aproximately 10mins). though the difference is much from the trip duration for males.
  • Customers go on a longer trip (780 sec or 13mins) than Subscribers(490 secs or 9mins).
  • Riders went on longer trips on Weekends (Saturdays and Sundays)
  • There is no linear relationship between age and duration of a trip. However most people who took longer trips were between the age of 25 and 45